Pilot-Abstraction: A Valid Abstraction for Data-Intensive Applications on HPC, Hadoop and Cloud Infrastructures?
نویسندگان
چکیده
HPC environments have traditionally been designed to meet the compute demand of scientific applications and data has only been a second order concern. With science moving toward data-driven discoveries relying more and more on correlations in data to form scientific hypotheses, the limitations of existing HPC approaches become apparent: Architectural paradigms such as the separation of storage and compute are not optimal for I/O intensive workloads (e. g. for data preparation, transformation and SQL workloads). While there are many powerful computational and analytical kernels and libraries available on HPC (e. g. for scalable linear algebra), they generally lack the usability and variety of analytical libraries found in other environments (e. g. the Apache Hadoop ecosystem). Further, there is a lack of abstractions that unify access to increasingly heterogeneous infrastructure (HPC, Hadoop, clouds) and allow reasoning about performance trade-offs in these complex environments. At the same time, the Hadoop ecosystem is evolving rapidly with new frameworks for data processing and has established itself as de-facto standard for data-intensive workloads in industry and is increasingly used to tackle scientific problems. In this paper, we explore paths to interoperability between Hadoop and HPC, examine the differences and challenges, such as the different architectural paradigms and abstractions, and investigate ways to address them. We propose the extension of the Pilot-Abstraction to Hadoop to serve as interoperability layer for allocating and managing resources across different infrastructures providing a degree of unification in the concepts and implementation of resource management across HPC, Hadoop and other infrastructures. For this purpose, we integrate Hadoop compute and data resources (i. e. YARN and HDFS) with the Pilot-Abstraction. In-memory capabilities have been successfully deployed to enhance the performance of large-scale data analytics approaches (e. g. iterative machine learning algorithms) for which the ability to re-use data across iterations is critical. As memory naturally fits in with the Pilot concept of re. taining resources for a set of tasks, we propose the extension of the Pilot-Abstraction to in-memory resources. These enhancements to the Pilot-Abstraction have been implemented in BigJob. Further, we validate the abstractions using experiments on cloud and HPC infrastructures investigating the performance of the Pilot-Data and Pilot-Hadoop implementation, HDFS and Lustre for Hadoop MapReduce workloads, and Pilot-Data Memory for KMeans clustering. Using Pilot-Hadoop we evaluate the performance of Stampede, a compute-centric resource, and Gordon, a resource designed for data-intensive workloads providing additional memory and flash storage. Our benchmarks of Pilot-Data Memory show a significant improvement compared to the file-based Pilot-Data for KMeans with a measured speedup of 212.
منابع مشابه
Hadoop on a Low-Budget General Purpose HPC Cluster in Academia
In the last decade, we witnessed an increasing interest in High Performance Computing (HPC) infrastructures, which play an important role in both academic and industrial research projects. At the same time, due to the increasing amount of available data, we also witnessed the introduction of new frameworks and applications based on the MapReduce paradigm (e.g., Hadoop). Traditional HPC systems ...
متن کاملCloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming
The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...
متن کاملmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
Traditional High Performance Computing (HPC) resources, such as those available on the TeraGrid, support batch job submissions using Distributed Resource Management Systems (DRMS) like TORQUE or the Sun Grid Engine (SGE). For large-scale data intensive computing, programming paradigms such as MapReduce are becoming popular. A growing number of codes in scientific domains such as Bioinformatics ...
متن کاملAn Efficient Scheduling of HPC Applications on Geographically Distributed Cloud Data Centers
Cloud computing provides a flexible infrastructure for IT industries to run their High Performance Computing (HPC) applications. Cloud providers deliver such computing infrastructures through a set of data centers called a cloud federation. The data centers of a cloud federation are usually distributed over the world. The profit of cloud providers strongly depends on the cost of energy consumpt...
متن کاملHigh-Performance Storage Support for Scientific Big Data Applications on the Cloud
This work studies the storage subsystem for scientific big data applications to be running on the cloud. Although cloud computing has become one of the most popular paradigms for executing data-intensive applications, the storage subsystem has not been optimized for scientific applications. In particular, many scientific applications were originally developed assuming a tightly-coupled cluster ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1501.05041 شماره
صفحات -
تاریخ انتشار 2015